Concept-match medical data scrubbing. How pathology text can be used in research.

نویسنده

  • Jules J Berman
چکیده

CONTEXT In the normal course of activity, pathologists create and archive immense data sets of scientifically valuable information. Researchers need pathology-based data sets, annotated with clinical information and linked to archived tissues, to discover and validate new diagnostic tests and therapies. Pathology records can be used for research purposes (without obtaining informed patient consent for each use of each record), provided the data are rendered harmless. Large data sets can be made harmless through 3 computational steps: (1) deidentification, the removal or modification of data fields that can be used to identify a patient (name, social security number, etc); (2) rendering the data ambiguous, ensuring that every data record in a public data set has a nonunique set of characterizing data; and (3) data scrubbing, the removal or transformation of words in free text that can be used to identify persons or that contain information that is incriminating or otherwise private. This article addresses the problem of data scrubbing. OBJECTIVE To design and implement a general algorithm that scrubs pathology free text, removing all identifying or private information. METHODS The Concept-Match algorithm steps through confidential text. When a medical term matching a standard nomenclature term is encountered, the term is replaced by a nomenclature code and a synonym for the original term. When a high-frequency "stop" word, such as a, an, the, or for, is encountered, it is left in place. When any other word is encountered, it is blocked and replaced by asterisks. This produces a scrubbed text. An open-source implementation of the algorithm is freely available. RESULTS The Concept-Match scrub method transformed pathology free text into scrubbed output that preserved the sense of the original sentences, while it blocked terms that did not match terms found in the Unified Medical Language System (UMLS). The scrubbed product is safe, in the restricted sense that the output retains only standard medical terms. The software implementation scrubbed more than half a million surgical pathology report phrases in less than an hour. CONCLUSIONS Computerized scrubbing can render the textual portion of a pathology report harmless for research purposes. Scrubbing and deidentification methods allow pathologists to create and use large pathology databases to conduct medical research.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Preparing Clinical Text for Use in Biomedical Research

Approximately 57 different types of clinical annotations construct a patient’s medical record. These annotations include radiology reports, discharge summaries, and surgical and nursing notes. Hospitals typically produce millions of text-based medical records over the course of a year. These records are essential for the delivery of care, but many are underutilized or not utilized at all for cl...

متن کامل

nature of information literacy in elementary schools Case study of Persian literature in fourth grade

Background and Aim: Information literacy is a contextual concept that needs to be studied in different contexts like schools. Promoting reading literacy is a core instructional objectives of Persian literature curriculum and also a part of information literacy. Understanding Concept of information literacy helps us to understand information literacy in elementary schools and can implement it in...

متن کامل

Strategies for de-identification and anonymization of electronic health record data for use in multicenter research studies.

BACKGROUND De-identification and anonymization are strategies that are used to remove patient identifiers in electronic health record data. The use of these strategies in multicenter research studies is paramount in importance, given the need to share electronic health record data across multiple environments and institutions while safeguarding patient privacy. METHODS Systematic literature s...

متن کامل

Data Mining Medication Prescriptions for a Representative National Sample

It is the purpose of this paper to examine how medications are prescribed to individuals in combination for one or multiple medical conditions, and to explore the use of such medications. The Agency for Healthcare Research and Quality yearly conducts the Medical Expenditure Panel Survey and makes the results available for research purposes. The latest survey released in February, 2004 was for t...

متن کامل

Identifying UMLS Concepts

Objective: The objective of this pilot project was to apply and evaluate methods for processing Emergency Department (ED) Chief Complaint (CC) terms, in order to identify the concepts that comprise the ED CC domain. Materials and Methods: A corpus of CC data was collected from three EDs representing urban, rural and suburban academic medical centers. For the pilot project, the corpus included a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Archives of pathology & laboratory medicine

دوره 127 6  شماره 

صفحات  -

تاریخ انتشار 2003